Spotify Random Sample Dataset on Users, and Song plays¶

Building recommendation engine for users based on past historical plays¶

In [2]:
import graphlab
In [3]:
song_data = graphlab.SFrame('song_data.gl/')
This non-commercial license of GraphLab Create for academic use is assigned to dbercz@gmail.com and will expire on February 07, 2018.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1504103810.log

Small snippet of our sample data¶

In [4]:
song_data.head(3)
Out[4]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
[3 rows x 6 columns]
In [5]:
graphlab.canvas.set_target('ipynb')
In [6]:
song_data['song'].show()
In [7]:
print('amount of songs in dataset: ',len(song_data))
('amount of songs in dataset: ', 1116609)
In [8]:
users = song_data['user_id'].unique()
In [9]:
print('number of unique users: ',len(users))
('number of unique users: ', 66346)

Splitting dataset into training and validation sets¶

Creating popularity based recommender.¶

The most popular songs are counted by plays according to user and compared to other users for recommending.¶

In [10]:
train_data,test_data = song_data.random_split(.8,seed=0)
In [11]:
popularity_model = graphlab.popularity_recommender.create(train_data,user_id='user_id',item_id='song')
Recsys training: model = popularity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 0.883059s
893580 observations to process; with 9952 unique items.

Picking a random users and displaying top 3 recommendations¶

In [12]:
popularity_model.recommend(users=[users[0]]).head(3)
Out[12]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sehr kosmisch - Harmonia 4754.0 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Undo - Björk 4227.0 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
You're The One - Dwight
Yoakam ...
3781.0 3
[3 rows x 4 columns]

Building a personalized recommendation model¶

In [24]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')
Recsys training: model = item_similarity
Warning: Ignoring columns song_id, listen_count, title, artist;
    To use one of these as a target column, set target = 
    and use a method that allows the use of a target.
Preparing data set.
    Data has 893580 observations with 66085 users and 9952 items.
    Data prepared in: 0.895472s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 2.787ms                        | 1.5        |
| 28.268ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 227.948ms                           | 0                | 0               |
| 655.791ms                           | 100              | 9952            |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 1.7249s
In [26]:
personalized_model.recommend(users=[users[0]]).head(3)
Out[26]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Riot In Cell Block Number
Nine - Dr Feelgood ...
0.0374999940395 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sei Lá Mangueira -
Elizeth Cardoso ...
0.0331632643938 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
The Stallion - Ween 0.0322580635548 3
[3 rows x 4 columns]
In [17]:
personalized_model.recommend(users=[users[1]])
Out[17]:
user_id song score rank
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424433009 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
There Goes My Baby -
Usher ...
0.0333266227603 2
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Panty Droppa [Intro]
(Album Version) - Trey ...
0.0318658401612 3
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nobody (Featuring Athena
Cage) (LP Version) - ...
0.027853068198 4
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Youth Against Fascism -
Sonic Youth ...
0.0263032036922 5
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Nice & Slow - Usher 0.0239837935781 6
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Making Love (Into The
Night) - Usher ...
0.0238530544409 7
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Naked - Marques Houston 0.0228925619283 8
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Love Lost (Album Version)
- Trey Songz ...
0.0228536024205 9
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Possessed - Kruiz 0.0228088947837 10
[10 rows x 4 columns]
In [18]:
personalized_model.get_similar_items(['With Or Without You - U2'])
PROGRESS: Getting similar items completed in 0.008224
Out[18]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.0430327868852 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.033734939759 2
With Or Without You - U2 Window In The Skies - U2 0.0328358208955 3
With Or Without You - U2 Vertigo - U2 0.0300751879699 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317829457 5
With Or Without You - U2 Bad - U2 0.0251798561151 6
With Or Without You - U2 A Day Without Me - U2 0.0237154150198 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.020325203252 8
With Or Without You - U2 Walk On - U2 0.020202020202 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850393701 10
[10 rows x 4 columns]
In [20]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])
PROGRESS: Getting similar items completed in 0.002774
Out[20]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118811881 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.187192118227 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834123223 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592274678 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761316872 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.019305019305 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191570881226 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.0187969924812 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.0187969924812 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.018779342723 10
[10 rows x 4 columns]
In [21]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)
compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 17001.6
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 22365.4

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0255885363357 | 0.00660202699609 |
|   2    | 0.0254179460935 | 0.0131865348344  |
|   3    |  0.024337541226 | 0.0195730015843  |
|   4    | 0.0223473217332 | 0.0237471406868  |
|   5    | 0.0208802456499 | 0.0279004544869  |
|   6    | 0.0195610144433 | 0.0313233195782  |
|   7    | 0.0186187064386 | 0.0347225124093  |
|   8    | 0.0175707949505 | 0.0380009590091  |
|   9    | 0.0167936616248 | 0.0408946127984  |
|   10   | 0.0159672466735 | 0.0431541003445  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 1282.29
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 1277.46

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.185602183555 | 0.0580033551119 |
|   2    |  0.155237120437 | 0.0913773449342 |
|   3    |  0.13590355965  |  0.11735511541  |
|   4    |  0.122824974411 |  0.137502018055 |
|   5    |  0.111838962811 |  0.154550447846 |
|   6    |  0.103093369726 |  0.170952566142 |
|   7    |  0.096602817176 |  0.18653678385  |
|   8    | 0.0893039918117 |  0.196434939449 |
|   9    |  0.084082034952 |  0.206375680122 |
|   10   | 0.0797679972706 |  0.21799005233  |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

Summary¶

We have increased top-end Mean Precision of predictions from 2.56% to 18.5%¶

We have also increase top end Mean Recall from 4.31% to 21.8% in the personalized (M1) model¶

Using an item similarity recommender has increased the quality of predictions by more than 7x¶

As we can observe from the scores of the predictions, the model can only be as good as the data.¶

If we had sentiment data included with positive score prediction, in a real production environment this would increase the quality of our recommendation engine.¶